The reliability of radiologists' first impression interpreting a screening mammogram

The initial impressions about the presence of abnormality (or gist signal) from some radiologists are as accurate as decisions made following normal presentation conditions while the performance from others is only slightly better than chance-level. This study investigates if there is a subset of radiologists (i.e., “super-gisters”), whose gist signal is more reliable and consistently more accurate than others. To measure the gist signal, images were presented for less than a half-second. We collected the gist signals from thirty-nine radiologists, who assessed 160 mammograms twice with a wash-out period of one month. Readers were categorized as “super-gisters” and “others” by fitting a mixture of Gaussian models to the average Area Under Receiver Operating Characteristics curve (AUC) values of radiologists in two rounds. The median intra-class correlation (ICC) for the “supergisters” was 0.63 (IQR: 0.51-0.691) while the median ICC for the “others” was 0.51 (IQR: 0.42-0.59). The difference between the two groups was significant (p=0.015). The number of mammograms interpreted by the radiologist per week did not differ significantly between “super-gisters” and others (medians of 237 versus 200, p=0.336). The linear mixed model, which treated both case and reader as random variables showed that only “super-gisters” can perceive the gist of the abnormal on negative prior mammograms, from women who developed breast cancer. Although detecting gist signal is noisy, a sub-set of readers have the superior capability in detecting the gist of the abnormal and only the scores given by them are useful and reliable for predicting future breast cancer.


Introduction
Previous studies have shown that radiologists are able to detect the gist of the abnormal on chest x-rays [1] or mammograms [2] at an above-chance level based on less than a second image presentation. This observation led to the "global-focal" search model, which suggests that the gist signal, arises from a local source showing a deviation from normality and guides observer's gist to the suspicious locations [3]. However, recent studies indicated that experts could detect the gist of the abnormal, in the normal breast contralateral to a cancer [4,5] and prior normal mammograms for women, who were diagnosed with breast cancer in a subsequent screening round [6], suggesting potential for breast cancer risk prediction [7]. In these studies, usually a mammogram is presented for a half-second and radiologists' first impression about the abnormality of a case (or gist signal) is recorded. This important information captured by the gist signal for identifying malignant cases, could be normally incorrectly ignored or overruled following more detailed fixations on images [8].
In previous studies, the radiologists' performances were analyzed as a single group to show the above chance-level performance in differentiating normal from abnormal images. A closer look at each individual radiologists' performance indicate considerable variation in the radiologists' performances in the gist experiments [9]. Such observation is not unexpected as in the usual viewing condition, inter-reader variations exist (10,11). However, the magnitude of variability is more notable with the rapid image presentation (i.e., in the gist experiment) (9). To provide a recommendation to the radiologists about whether they should trust their initial impression or to utilize the gist signal for improving breast cancer detection (12) or identifying high risk individuals (8), we need to establish how reliable the gist signal is or if there is a subset of radiologists (gist experts or "super-gisters"), whose gist responses are more reliable and accurate than others.
Based on the previous studies (3)(4)(5)(6)(7)(8), it is unknown whether a "super-gister" based on a single experiment, is actually better at detecting the gist of the abnormal or that it is a random process, and no recommendation can be made for a specific observer about the accuracy of their initial impression. Also, the magnitude of intra-reader variability of the gist signal is unknown. In one of our earlier experiments, in addition to the gist performance, the radiologists' usual diagnostic performance, as measured by the area under the receiver operating characteristics curve (AUC), was recorded in the usual viewing condition. The results showed that two AUC values were not correlated [9]. Therefore, "super-gisters" were not necessarily outperformed other radiologists in the usual viewing and reporting condition.
In this study by conducting the gist experiment twice (with a washout period of at least one month), we explored: (1) whether the radiologists' gist performances in two rounds were related.
(2) whether a subset of "super-gisters" exist and if so for that subset, determined: (2-1) whether interpretive volume (number of mammograms, interpreted by the radiologists in each week) and years of experience reading mammograms can predict who they are; (2-2) whether test-retest reliability of their gist signal was higher than for other radiologists; (2-3) whether all readers have the capability of perceiving gist information related to predicting a future breast cancer or only "super-gisters" has such capability. As stated, the gist of the abnormal can indicate presence of a current breast cancer [10] as well as elevated risk of a future breast cancer [11]. However, our earlier studies showed that the strength of the gist signal is weaker in the latter, i.e., when no explicit actionable lesion is present, but the mammogram is acquired from a woman who would eventually develop a breast cancer.

Experimental procedure to collect gist responses
We used a multi-observer experimental protocol to answer the study question. The study protocol ( Figure 1) is extensively described elsewhere [6]. Briefly, to record the gist signal, first a red cross symbol appeared for a half-second at the center of the display to ensure participants fixated. Then, the mammogram was shown for a half-second. After the mammogram a mask corresponding to the breast area was presented to the visual processing of the mammogram. To record the gist response, a sliding bar was shown to the readers to indicate the probability of the case being abnormal on a scale of 0 to 100. The reader had unlimited time in this stage.

Figure 1-The experimetal protocol for recording gist responses.
Using the above-mentioned experimental procedure, 39 radiologists and breast physicians, read a set of mammography cases twice, i.e., Round 1 and Round 2. The time interval between two rounds was at least one month to reduce recall bias. In each round, the participants were asked to provide the gist responses to 160 CC mammograms, presented in random order. These mammograms were from screening examinations and retrospectively retrieved from Breast Screen Reader Assessment Strategy (BREAST) data bank [12]. We included four categories of images in the experiment and each category contained 40 images. The "Normal" category represents mammograms from women who had normal radiographic appearances at the time of examination and remained cancer-free in the next round of screening. The mammograms from "Cancer" category, contained subtle biopsy-proven malignancies. Finally, the "Prior_Vis" and "Prior_Invis" categories were images with or without overt cancer signs, from women who were diagnosed at a subsequent screening round. To categorize the prior mammograms in these two categories, two experienced radiologists were asked to assess these images based on the pathology reports.

Identifying "super-gisters"
In each session (i.e. data from 1 observer in 1 round), the AUC values for discriminating "Cancer" from "Normal", "Prior_vis" from "Normal", and "Prior_Invis" from "Normal" were calculated. The readers were categorized as "super-gisters" and "Others" based on their AUC values. To categorize the participants, their AUC values in two rounds were averaged and a mixture of Gaussian with two components was fitted to these average AUC values. The threshold for categorization was set to , where 1 and 2 are mean values of two components while 1 and 2 are standard deviations. Once the model was fitted to the AUC values, it turns out that one-third of our participants were categorized as "super-gisters".

Statistical analysis
In this section, the detail of the statistical analysis approaches taken to answer each one of the study questions are provided.

Exploring whether the radiologists' gist performances in two rounds were related
As stated, three AUC values were calculated for each image category (discriminating "Cancer" from "Normal", "Prior_vis" from "Normal", and "Prior_Invis" from "Normal"). For each reader, we also averaged the gist responses in two rounds and calculated the AUC values for these new gist scores. Averaging scores from two rounds could help with cancelling out noise to some extent. Therefore, for each reader, nine AUCs were available: (2 rounds + average of two rounds) × 3 categories. We calculated the Pearson's correlation coefficient between all possible AUC pairs.

Characteristics of "super-gister"'s and their test-retest reliability
To explore the characteristics of the "super-gisters", by using the Mann-Whitney U test, we examined if readers with a higher interpretive volume or years of experience are "super-gisters".
We calculated the intra-class correlation (ICC) between the scores given in two rounds. Using Mann-Whitney U test we explored if the ICC values differed significantly between "super-gisters" and "others".

How case category and reader characteristics affect the gist responses
To explore how case category and reader characteristics affect the gist responses, a linear mixed model was built. It treats the case category and interaction between the case category and reader characteristics as the fixed effects while considering readers and cases as a random effect. Included reader characteristics were whether a reader is a "super-gister", interpretive volume, number of biopsies per year, and diagnostic focus (i.e., the percentage of time dedicated to reading diagnostic mammograms). We entered these three variables into the model since they exhibited significant (although weak) correlation with the performances for Cancer vs Normal categorization. In the previous studies [4-6, 11, 13, 14], a Wilcoxon signed ranks test based on the reader's AUC estimates was used to show if the median AUC is significantly different from the chance-level. This test treats the cases as fixed, thus conclusions apply to the population of readers but of course, assuming fixed cases limits the scope of the conclusions to the sample cases. As stated, the linear mixed model, however, treated both readers and cases as random effects. Table 1 indicates the Pearson correlation coefficient. Other than AUC for the Prior_Invis (Round 1 and 2) versus Normal in the first round, all correlation coefficients except for were 0.50 or more. When the signal was averaged and then the AUC values were calculated, the AUC for the Prior_Invis category was above 0.50. Such averaging, results in cancelling out the noise of the gist signal.

Characteristics of "super-gister"'s and their test-retest reliability
The median values for the number of cases read by the radiologists per week and number of years reading mammograms for the super-gisters were 237 cases and 21 years respectively while these values were 200 cases and 18 years for other radiologists. Neither number of cases read by the radiologists per week, nor number of years reading mammogram did not differ significantly between two groups (p-values of 0.336 and 0.615, respectively).
We also explored whether test-retest reliability of "super-gister"'s gist signal was higher compared to other participants. The median ICC for the "super-gisters" was 0.63 (IQR: 0.51-0.691) while the median ICC for the "others" was 0.51 (IQR: 0.42-0.59). The difference between the two groups was significant (p=0.015). Figure 2, 3, and 4 indicates waterfall plots, showing the coefficient of the linear mixed model fitted to the data from the second Round for relating the gist responses to the case category and interaction terms between case category and reader characteristics. All categories were relative to Normal, which serves as the baseline. Therefore, an average reader would assign 27.93 (i.e., intercept of the model) to an average "Normal" case.

How case category and reader characteristics affect the gist responses.
In Figure 2, the interaction term "Cancer|Super-gister" shows the interaction between the cancer category and the super-gister flag (0 or 1) based on the first round. Similarly, other interaction terms in Figures 2-4 were shown using "|". The p-values for all coefficients expect for "Cancer|Diagnostic focus" was significant. For the interaction term "Cancer|Weekly case", the weekly number of cases was discretized into four scores, representing its four quartiles. Similarly, the variables number of biopsy examinations in the last 12 months and diagnostic focus were discretized into four quartiles. As shown in Figure 3, the interaction effect with the largest effect on the gist response is "Cancer|Super-gister". Similarly, in Figure 2 and 3, the coefficients of the mixed linear model associated with "Prior_Vis" and "Prior_Invis" categories are shown. As shown, parameters such as interpretive volume, number of biopsies, or diagnostic focus does not significantly impact the gist score provided to an

4.Conclusion
In our earlier studies, we showed that the average gist signal can be used to predict a future breast cancer [7] as well as improve the performance of both radiologists and a deep-learning model in the breast cancer detection [10,15]. The average gist signal was defined as the mean value of gist responses assigned by a group of radiologists to a case. The averaging operation in these studies cancelled out the noise to some extent. The current study shows that detecting the gist signal is noisy, with relatively low levels of inter-and intra-reader agreement. However, a sub-set of readers (i.e., "super-gisters") have superior capability and only the scores given by them is useful for predicting a future breast cancer (based on linear mixed modelling result). The "super-gisters" exhibited higher intra-reader agreement in the gist experiment. At present, gist experts are identified by their performance on the gist task. The results of mixed linear modelling imply that, when treating both readers and case as random effects: (1) only super-gisters could detect the gist of the abnormal on Prior_Invis image. (2) For Prior_Vis, all readers can detect the gist of the abnormal but being a super-gister, roughly doubles the strength of perceived signal. (3) For cancer cases, the case category is the main driver of the higher observed gist responses, although super-gisters and those with a higher reading volume perceived stronger gist of the abnormal on cancer cases.